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Abstract 

We study an efficient dynamic blind source separation algorithm of 
convolutive sound mixtures based on updating statistical information 
in the frequency domain, and minimizing the support of time domain 
dcmixing filters by a weighted least square method. The permutation 
and scaling indeterminacies of separation, and concatenations of signals 
in adjacent time frames are resolved with optimization of I 1 x 1°° norm 
on cross-correlation coefficients at multiple time lags. The algorithm is 
a direct method without iterations, and is adaptive to the environment. 
Computations on recorded and synthetic mixtures of speech and music 
signals show excellent performance. 
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1 Introduction 



Blind source separation (BSS) methods aim to extract the original source sig- 
nals from their mixtures based on the statistical independence of the source 
signals without knowledge of the mixing environment. The approach has 
been very successful for instantaneous mixtures. However, realistic sound 
signals are often mixed through a media channel, so the received sound 
mixtures are linear convolutions of the unknown sources and the channel 
transmission functions. In simple terms, the observed signals are unknown 
weighted sums of the signals and its delays. Separating convolutive mixtures 
is a challenging problem especially in realistic settings. 

In this paper, we study a dynamic BSS method using both frequency and 
time domain information of sound signals in addition to the independence 
assumption on source signals. First, the convolutive mixture in the time 
domain is decomposed into instantaneous mixtures in the frequency domain 
by the fast Fourier transform (FFT). At each frequency, the joint approx- 
imate diagonalization of eigen-matrices (JADE) method is applied. The 
JADE method collects second and fourth order statistics from segments of 
sound signals to form a set of matrices for joint orthogonal diagonalization, 
which leads to an estimate of de-mixing matrix and independent sources. 
However, there remain extra degrees of freedom: permutation and scaling of 
estimated sources at each frequency. A proper choice of these parameters is 
critical for the separation quality. Moreover, the large number of samples of 
the statistical approach can cause delays in processing. These issues are to 
be addressed by utilizing dynamical information of signals in an optimiza- 
tion framework. We propose to dynamically update statistics with newly 
received signal frames, then use such statistics to determine permutation in 
the frequency domain by optimizing an I 1 x l°° norm of channel to channel 
cross-correlation coefficients with multiple time lags. Though cross channel 
correlation functions and related similarity measure were proposed previ- 
ously to fix permutation [13], they allow cancellations and may not measure 
similarity as accurately and reliably as the norm (metric) we introduced 
here. The freedom in scaling is fixed by minimizing the support of the esti- 
mated de-mixing matrix elements in the time domain. An efficient weighted 
least square method is formulated to achieve this purpose directly in con- 
trast to iterative method in [T7j. The resulting dynamic BSS algorithm is 
both direct and adapted to the acoustic environment. Encouraging results 
on satisfactory separation of recorded sound mixtures are reported. 

The paper is organized as follows. In section 2, a review is presented 
on frequency domain approach, cumulants and joint diagonalization prob- 
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lems, and indeterminacies. Then the proposed dynamic method is presented, 
where objective functions of optimization, statistics update and efficient 
computations are addressed. Numerical results are shown and analyzed to 
demonstrate the capability of the algorithm to separate speech and music 
mixtures in both real room and synthetic environments. Conclusions are in 
section 3. 

2 Convolutive Mixture and BSS 

Let a real discrete time signal be s(k) = [s±(k), S2(k), • • • , s n (k)], k a discrete 
time index, such that the components Si(k) (i = 1, 2, • • • , n), are zero-mean 
and mutually independent random processes. For simplicity, the processing 
will divide s into partially overlapping frames of length T each. The in- 
dependent components are transmitted and mixed to give the observations 
Xi(k): 

n P-l 

x ii k ) =Y1 a ij{p) s ji k ~p), i = l,2,---,n; (2.1) 

j=l p=0 

where aij(p) denote mixing filter coefficients, the p-th element of the P- 
point impulse response from source i to receiver j. The mixture in (|2.1I) 
is convolutive, and an additive Gaussian noise may be added. The sound 
signals we are interested in are speech and music, both are non-Gaussian 
PQ. We shall consider the case of equal number of receivers and sources, 
especially n = 2. 

An efficient way to decompose the nonlocal equation (12. ip into local ones 
is by a T-point discrete Fourier transform (DFT) [2], Xj{u, t) = Y1t=o %j(t+ 
t) e - 2nJujT } where J = \J — 1, a; is a frequency index, u = 0, 1/T, • • • , (T — 
1)/T, t the frame index. Suppose T > P, and extend aij{p) to all p G 
[0, T — 1] by zero padding. Let Hij(u) denote the matrix function obtained 
by T-point DFT of aij(p) in p, Sj(u, t) the T-point DFT of Sj(k) in the t-th 
frame. If P <C T, then to a good approximation |17j : 

X(w,t)f*H(u;)S(u,t), (2.2) 

where X = [X\, ■ ■ ■ , X n ] Tr , S = [si, ■ ■ ■ , s n ] Tr , Tr is short for transpose. 
The components of S remains independent of each other, the problem is 
converted to a blind separation of instantaneous mixture in (I2.2p . Note 
that P is on the order of 40 to 50 typically, while T is 256 or 512, so the 
assumption P <C T is reasonable. 
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2.1 Instantaneous Mixture and JADE 

Let us briefly review an efficient and accurate method, so called joint approx- 
imate diagonalization eigen-matrices (JADE) |6j for BSS of instanteneous 
mixture. There are many other approaches in the literature [I], e.g. info- 
max method pQ which is iterative and based on maximizing some informa- 
tion theoretical function. JADE is essentially a direct method for reducing 
covariance. We shall think of S as a random function of t, and suppress oj de- 
pendence. First assume that by proper scaling E^Sj^)! 2 ] = 1, j = 1, • • • , n. 
It follows from independence of sources that (' conjugate transpose): 

E[S(t)S(t)'] = I n , R x = E[X(t)X(t)'] = HH\ (2.3) 

the latter identity is a factorization of the Hermitian covariance matrix of 
the mixture. However, there is non-uniqueness in the ordering and phases 
of columns of H. Suppose that (1) the mixing matrix H is full rank; (2) the 
Sj(t)'s are independent at any t; (3) the process S(t) is stationary. Let W 
be a matrix such that I n = WRxW = WHH'W' , W is called a whitening 
matrix. Then WH is an orthogonal matrix, denoted by U. Multiplying W 
from the left onto (|2.2p . one finds that: 

Z(t) = WX(t) = US(t). (2.4) 

The 4th order statistics are needed to determine U. The 4th-order cumulant 
of four mean zero random variables is: 

Cum[a, b, c, d] = E(abcd) - E(ab)E(cd) - E(ac)E(bd) - E(ad)E(bc), (2.5) 

which is zero if a, b, c, d split into two mutually independent groups. For 
source vector 5, Cum[5j, Sj, Sk, Si] = kuitiSijki, kurtj = Cum[5j, Si, Si, Si] 
is the kurtosis. If kurtj ^ 0, the i-th source is called kurtic. Kurtosis is zero 
for a mean zero Gaussian random variable. The last assumption of JADE 
is that (4) there is at most one non-kurtic source. 

Define cumulant matrix set Qz(M) from Z in (|2.4p as the linear span of 
the Hermitian matrices Q = (qij) satisfying (* complex conjugate): 

n 

lij = Cnm ( z ii Z j, z k,Zl)mi k , l<i,j<n, (2.6) 

k,l=l 

where matrix M = (rriij) = eie k , e/ being the unit vector with zero com- 
ponents except the i-th component equal to one. Equations (12. 4p and (|2.6p 



3 



imply that (u p is the p-th column of U): 

n 

Q = J2 ( kurt P u ' p Mu p) U P U ' P > v M > ( 2 - 7 ) 

p=l 

or Q = UDU' ', -D = diag(kurtiu' 1 Mui, • • • , kurt n u' n Mu n ). Hence, U is the 
joint diagonalizer of the matrix set Qz{M). Once U is so determined, the 
mixing matrix H = W^U. It can be shown [6] using identity (|2,7p that the 
joint diagonalizer of Qz(M) is equal to U up to permutation and phase, or 
up to a matrix multiplier P where P has exactly one unit modulus entry in 
each row and column. Such a joint diagonalizer is called essentially equal to 
U. 

The algorithm of finding the joint diagonalizer is a generalization of 
Jacobi method or Givens rotation method [9j. As the cumulant matri- 
ces are estimated in practice, exact joint diagonalizer may not exist, in- 
stead, an approximate joint diagonalizer, an orthogonal matrix V, is sought 
to maximize the quantity: C(V,B) = J2r=i |diag(V B r V)\ 2 , where B = 
.E>2, • • • i B n 2} is a set of basis (or eigen) matrices of Qz{M), |diag(^4)| 2 
is the sum of squares of diagonals of a matrix A. Maximizing C(V,B) is 
same as minimizing off diagonal entries, which can be achieved in a finite 
number of steps of Givens rotations. The costs of joint diagonalization is 
roughly n 2 times that of diagonalizing a single Hermitian matrix. 

Though stationarity is assumed for the theoretical analysis above, JADE 
turns out to be quite robust even when stationarity is not exactly satisfied 
for signals such as speech or music. 

2.2 Dynamic Method of Separating Convolutive Mixture 

For each frequency u, equation (|2.2p is a BSS problem of instantaneous 
mixtures. The speech or music signals in reality are stationary over short 
time scales and nonstationary over longer time scales, which depend on the 
production details. For speech signals, human voice is stationary for a few 
10 ms, and becomes non-stationary for a time scale above 100 ms due to 
envelope modulations [U [13]. The short time stationarity permits FFT to 
generate meaningful spectra in equation (|2.2p within each frame. For a 
sampling frequency of 16,000 Hertz, each frame of 512 points lasts 32 ms. 
The mixing matrix H may depend on t over longer time scales, denoted by 
H = H(u,t), unless the acoustic environment does not change as in most 
synthetic mixing. A demixing method with potential real time application 
should be able to capture the dynamic variation of mixing. 
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Our approach consists of four steps. Step I is to find an initialization 
for H(u,t). After receiving the initial ut frames of mixtures, compute their 
FFT and obtain X(u,t), t = 1,2, •■•,riT, to collect ut samples at each 
discrete frequency. For each u>, perform JADE, and estimate the mixing 
matrix denoted by Hq(lo). To ensure a good statistical estimate, tit is on 
the order of 80 to 100, and may be properly reduced later. 

Step I gives separated components of signals over all frequencies. How- 
ever, such JADE output has inderterminacies in amplitude, order and phase. 
This benign problem for instantaneous mixtures becomes a major issue when 
one needs to assemble the separated individual components. For example, 
the permutation mismatches across frequencies can degrade the quality of 
separation seriously. 

Step II is to use nonstationarity of signals to sort out a consistent 
order of separated signals in the frequency domain. Such a method for 
batch processing was proposed in [13]. A separation method requiring 
the entire length of the signal is called batch processing. The sorting al- 
gorithm of [13] proceeds as follows. (1) Estimate the envelope variation 
by a moving average over a number of frames (beyond stationarity time 
scale) for each separated frequency component. The envelope is denoted 
by Env(u;,i, i), where i is the index of separated components. (2) Com- 
pute a similarity measure equal to the sum of correlations of the envelopes 
of the separated components at each frequency. The similarity measure 
is sim(o;) = YU^j /o(Env(o>, t, i), Env(w, t, j)), where /?(•,•) is the normal- 
ized correlation coefficients (see (12. 9p ) involving time average over the en- 
tire signal length to approximate the ensemble average so the t dependence 
drops out. (3) Let u\ be the one with lowest similarity value where sep- 
aration is the best. The uj\ serves as a reference point for sorting. (4) 
At other frequencies (k = 2,3,---), find a permutation a to maxi- 
mize Ya=i /o(Env(wfc, t, cr(i)), X)*=i Env s (u;j, t, i)), among all permutations 
of 1, 2, • • • , n. Here Env s denotes the sorted envelopes in previous frequen- 
cies. (5) Permute the order of separated components at the k-th frequency 
bin according to a in step (4), and define Env^Wfc, t, i). Repeat (4) and (5) 
until k = T. 

We shall modify the above sorting method in three aspects. The first is 
to use segments of signal instead of the entire signal to compute statistics 
(correlations) to minimize delay in processing. The second is to use correla- 
tion coefficients of separated signals at un- equal times or multiple time lags 
in step (2) to better characterize the degree of separation. Moreover, we no- 
tice that the similarity measure of [13] as seen above is a sum of correlation 
coefficients of potentially both signs, and so can be nearly zero due to can- 
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cellations even though each term in the sum is not small in absolute value. 
We introduce an I 1 x l°° norm below to characterize more accurately chan- 
nel similarity by taking sum of absolute values of correlation coefficients and 
maximum of time lags. The third is to simplify the maximization problem 
on a to avoid comparing correlations with summed envelopes at all previ- 
ous frequencies. We also do not use envelopes of signals inside correlation 
functions. The reason is that the smoothing nature of envelope operation 
reduces the amount of oscillations in the signals and may yield correlation 
values less accurate for capturing the degree of independence. Specifically, 
let Si(uj,t) = a,i(u, t) ^^\ w 'v be the i-th separated signal at frequency u, 
where a,i(u,t) = \§i(uj,t)\, fa the phase functions, t the frame index. The 
correlation function of two time dependent signals over M frames is: 

M M M 

cov (a(u,t),b(u',t)) = M- 1 ^ a(uj,t)b*(uj' ,t)-M- 2 ^a{u:,t) J2 b *( u ' >*)> 

t=l t=i t=i 



and the (normalized) correlation coefficient is: 



(2. 



, / \ i / i cov(a(o>, t), b(u', t)) ,„ „. 

p(a{u, t),b(u>', t)) = v v I ' v " (2.9) 

\/cov(a(a;, t), a(ui, t)) cov(b(uj', t), b{Lo', t)) 

From speech production viewpoint, frequency components of a speech signal 
do not change drastically in time, instead are similarly affected by the motion 
of the speaker's vocal chords. The correlation coefficient is a natural tool for 
estimating coherence of frequency components of a speech signal. A similar 
argument may be applied to music signals as they are produced from cavities 
of instruments. 

Now with M = n T in (|£g]> . define 



C(u)=y2 max \p(\si(uj,t)\,\sj(uj,t - k)\)\, for u G [ujl , uu] 
k£{-K ,...,Ko} 

(2.10) 

with some positive integer Kq. Find uj\ between ujl and ujjj to minimize 
C(uj). With loi as reference, at any other uj, find the permutation a to 
maximize: 

n 

<7 = argmaxV max \p(\§i(ui, t)\, \s a a) (w, t - k)\)\. (2.11) 
^ ke{-K ,...,K } 

Notice that the objective functions in (|2.10p - (|2.1ip are exactly the I 1 x l°° 
norms over the indices and k. Multiple time lag index k is to accomodate 
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the translational invariance of sound quality to the ear. Maximizing over k 
helps to capture the correlation of the channels, and sum of i (j) reflects the 
total coherence of a vector signal. 

Step III fixes the scaling and phase indeterminacies in s(u>,t). Each row 
of the de-mixing matrix Hq 1 ^) may be multiplied by a complex number 
Xi(to) (i = 1, 2 • • • , n) before inverse FFT (ifft) to reconstruct demixing ma- 
trix h^°'(r) in the time domain. The idea is to minimize the support of 
each row of the inverse FFT by a weighted least square method. In other 
words, we shall select Aj's so that the entries of ifft(i? _1 )(r) = h^°'(r) are 
real and nearly zero if r > Q for some Q < T, Q as small as possible, T 
being the length of FFT. Smaller Q improves the local approximation, or 
accuracy of equation (|2.2p . To be more specific, using HqI(uj) to denote the 
i-th row vector of H^ 1 ^), we can explicitly write the equation to shorten 
the support of inverse FFT: 

m(X i (u;)H l(u))(T)=0 (2.12) 

in terms of the real and imaginary parts of \i(uj) for uj = 0, 1/T, (T— 1) /T. 
Those real and imaginary parts are the variables and the equations are linear. 
Now, we let r run from q to T — 1. If we want small support, q should be 
small, then there are more equations than unkowns. So we multiply a weight 
to each equation and minimize in the least square sense. Equation (12.121) 
for larger r is multiplied by a larger weight in the hope that the value of the 
left hand side of (12.121) will be closer to zero during the least square process. 
If we choose the weighting function to be the exponential function f3 T for 
some (3 > 1, then the above process can be mathematically written as 

T-l 

[Ai(0), \i((T - 1)/T)] = argmin ]T l^ifftlA,^)^- 1 ^))^)! 2 (2.13) 

T=q 

where HqI{oj) is the i-th. row vector of Hq 1 ^). 

A few comments are in order. First, since the mixing matrix Hq{uj) is the 
FFT of a real matrix, we impose that Hq(uj) = Hq(1 — uj)*. So, supposing T 
is even, we only need to apply JADE to obtain Hq(uj) for uj = 0, 1/T, 1/2; 
Hq(0) and Hq(1/2) will automatically be real. When fixing the freedom of 
scaling in each u>, we choose A(0) and A(l/2) real, and A(u>) = A(l — uj)* for 
other u. Second, to fix the overall scaling and render the solution nontrivial, 
we set A(0) = 1. Third, the weighted least square problem (|2.13p can be 
solved by a direct method or matrix inversion (chapter 6 in [9]). 

Note that when n = 2, among the 2(T — q) equations from (|2.12p with 
r = q,...,T — 1, there are T — 1 variables including Aj(l/2), the real and 
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imaginary parts of \i(tv) for oj = 1/T, ...,1/2 — 1/T. So, we can make roughly 
half of hf ] (T) 0, the best one can achieve in general. Separated signals, 
denoted by s^°'(t), are then produced, for t £ [0,ny], t the frame index. 

The last step IV is to update (r) when Sut <C many new frames 
of mixtures arrive. The steps I to III are repeated using frames from 5tit + 1 
to 5riT + tlt, to generate a new time domain demixing matrix ^'(t), t £ 
[0, T — 1], and separated signal sW(r), t G [T(riT — Any) + 1, T(ut + <5riT)] 
with T the size of one frame. We use r here instead of t because in the 
most part of the paper, t is the frame index. Now, sW(t) and s'°)(t) share 
a common interval of size TAray. On this common interval, s^\t) and 
s(°)(t) will be the same if we are doing a perfect job and if the ordering 
of is consistent with that of s^°\ In order to determine the ordering 
of s^(t), we compute p (^s[°\t), s^\t — k)j on this common interval with 
different k and i, j = 1, ...,n. Then we determine the permutation a of the 
components of s^(t) by minimization: 



a = argmax > max 



(4°\r),s% } (r-k))\ (2.14) 



with some constant K\. After doing the necessary permutation of the 
separated signals are then extended to the extra frames Sut+tit by concate- 
nating the newly separated 5ht many frames of with those of . The 
continuity of concatenation is maintained by requiring that max T |/i^(t)|'s 
(i = 1, 2, • • • , n) are invariant in k, where k = 1, 2, • • •, labels the updated 
filter matrix in time. The procedure repeats with the next arrival of mixture 
data, and is a direct method incorporating dynamic information. 

Because sorting order depends only on the relative values of channel correla- 
tions, we observed in practice that the max fce {_^ K y in equations (|2.10p . 

(|2.1ip . (|2.14p may be replaced by J2k=-K > w hh a different choice of K, 
value. The max^, e |_^ ^ j is a more accurate characterization however. 



2.3 Adaptive Estimation and Cost Reductions 

Cumulants and moments are symmetric functions in their arguments |15| . 
For example when n = 2, there are 16 joint fourth order cumulants from 
(I2.5h . however, only six of them need to be computed, the others follow from 



S 



symmetry. Specifically, among the 16 cumulants: 



OU) 
0(3) 

0(5) 
0(7) 
0(9) 

Q(n) 

0(13) 
Q(15) 



: Cum(yi, 
: Cum(yi, 
: Cum(yi, 
: Cum(yi, 
: Cum(y 2 , 
: Cum(y 2 , 
: Cum(y 2 , 
: Cum(y 2 , 



2/1, 2/T: 

2/1,2/2: 

2/1, 2/l: 
2/2,2/2: 
2/1, 2/1: 
2/1,2/2: 
2/2,2/1: 

2/2,2/2.- 



2/1 J 
2/1 ) 
2/1 ) 

2/1) 

2/1 ) 
2/1 ) 
2/1 ) 
2/1 ) 



0(2) 
0(4) 
0(6) 
0(8) 
0(10) 
0(12) 
0(14) 
0(16) 



: Cum(yi, 
■■ Cum(yi, 

■ Cum(yi, 

■ Cum(yi, 
: Cum(y 2 , 

: Cum(y 2 , 
■- Cum(y 2 , 

■ Cum(y 2 , 



2/i, 2/i, 2/2 j 
2/1,2/5,2/2) 
2/2,2/1,2/2) 
2/2,2/2,2/2) 
2/1,2/1,2/2) 
2/1,2/2,2/2) 
2/2,2/1,2/2) 
2/2,2/1,2/2) 



we have the relations: 0(2) = 0(3)* = 0(5)* = 0(9), 0(4) = 0(6) = 
0(H) = Q(13), 0(7) = 0(10)*, 0(8) = 0(15) = 0(12)* = 0(14)*, where * 
is complex conjugate. For N samples, we only need to compute the following 
six lxJV vectors 



Yi = (2/12/1,. 
Yz = {y\yl. 
Y 5 = (ylyl*,.. 



-,2/W), 
,2/W*), 



I2 



Y A 



= (2/12/2, • 
(2/i 1 2/i*,- 



-,2/iV), 
,2/W*), 

,lW). 



then all the 4th order and 2nd order statistical quantities can be recon- 
structed. For example, 

1 



0(1) 



N 



Tr 



1 

'N2 



(2 sum(y 4 ) sum(Ki) + sum(Yi) (sum(Yi)*)) (2.15) 



where sum(li) is the summation of the N components of Yi. 

As formula (|2.5p suggests, cumulants are updated through moments 
when Sut early samples are replaced by the same number of new samples. 
As 5tit is much less than the total number of terms ny in the empirical 
estimator of expectation, the adjustment costs 25ut flops for each second 
moments and 65riT flops for each joint fourth order moment. The con- 
tributions of the early samples are subtracted from the second and fourth 
moments, then the contributions of the new samples are added. The cu- 
mulant update approach is similar to cumulant tracking method of moving 
targets ( |12j and references therein). 

Due to dynamical cumulants update, the prewhitening step at each fre- 
quency is performed after cumulants are computed from X(uj). This is dif- 
ferent from JADE [6] where the prewhitening occurs before computing the 
commulants. This way, it is more convenient to make use of the previous 
cumulant information and updated X(u). Afterward, we use the multilin- 
earity of the cummulants to transform them back to the commulants of the 
prewhitened X(uj), before joint diagonalization. 
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It is desirable to decrease ny to lower the number of samples for cu- 
mulants estimation. However, this tends to increase the variance in the 
estimated cumulants, and render estimation less stable in time. Numeri- 
cal experiments indicated that with ny as low as 40, the separation using 
overlapping frames is still reliable with reasonable quality. 

It is known [8J that the identity of a speaker is carried by pitch (per- 
ception of the fundamental frequency in speech production) which varies in 
the low frequency range of a few hundred Hertz. We found that instead of 
searching among all frequencies for the reference frequency U\ in step II, it is 
often sufficient to search in the low frequency range. The smaller searching 
range alleviates the workload in sorting and permutation correcting. This 
is similar to a feature oriented method, see |16|, [TBI E] among others. 

2.4 Experimental Results 

The proposed algorithm with adaptivity and cost reduction considerations 
was implemented in Matlab. The original code of JADE by J.-F. Cardoso 
is obtained from a open source (http://web.media.mit.edu/~paris/ ) main- 
tained by P. Smaragdis. Separation results with both dynamic and batch 
processing of three different types of mixtures are reported here: 

(1) real room recorded data; 

(2) synthetic mixture of speech and music; 

(3) synthetic mixture of speech and speech noise. 

They will be called case (1), (2) and (3) in the following discussion. 

The values of the parameters used in the three cases are listed in Table [TJ 
In the table, "n-r (dyn.)" is the initial value of ut in dynamic processing 
and "rtT (bat.)" is the tit in batch processing. Other than n*r, dynamic and 
batch process share the same parameters. The frame size is T, "overlap" 
is the overlapping percentage between two successive frames, 5nj< and Any 
are as in step IV, Kq and K\ are from (12. lip and (|2.14p . (3 is in (12.131) . and 
q is the lower limit of r in (I2.12|) . 

Note that the values of ujl and lujj from (|2.10p are not listed in the table. 
In our computation, we use the following two choices 

(A) uo L = 0,uu = 1/2. 

(B) lol = oju = 4/T, namely fixing reference frequency u\ = 4/T. 
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For the three cases reported in this paper, both choices work and generate 
very similar results. As a consequence, we will only plot the results of the 
first choice. The first choice is more general while the second is motivated 
by the pitch range of speech signal and is computationally more favorable. 
However, we do not know precisely the robustness of the latter. 



case 


T 


overlap 


riT 

(dyn.) 


5ut 


Anj- 


K 


K x 





Q 


(bat.) 


(1) 


512 


0% 


100 


20 


30 


4 


10 


1.04 


2 


200 


(2) 


256 


50% 


100 


20 


40 


15 


20 


1.04 


2 


160 


(3) 


256 


50% 


100 


20 


40 


10 


20 


1.04 


2 


160 



Table 1: Parameters used in both dynamic and batch processing. 

For a quantitative measure of separation in all three cases, we compute 
the maximal correlation coefficient over multiple time lags: 

p(a, b) = max \p(a(r), b(r + k))\ (2.16) 

ke{-K2,---,K2} 

with p defined in (j2.9f) . The p is computed for the mixtures, the sources and 
the separated signals for both batch and dynamic processing. An exception 
is the lack of sources in case (1). We choose K% = 20 in all the computations. 
The results are listed in Table[2]which shows that the p values of the mixtures 
are much larger than those of the dynamically separated signals, which are 
on the same order as the p values of the batch separated signals. In the 
synthetic cases (2) and (3), the p values of the batch separated signals are 
on the same order of the p values of the source signals or 10 -2 . In cases (2) 
and (3), we use the ratio p(x, si)/p(x, S2) to measure the relative closeness 
of a signal x to source signals si and S2- Table [3] lists these ratios for x 
being the separated signals by dynamic and batch methods with A and B 
denoting the two ways of setting the reference frequency uj\ . The outcomes 
are similar no matter x = s\ or x = s~2 (first or second separated signal) 
in either dynamic or batch cases and either way of selecting the reference 
frequency uj\ . 

In case (1), the recorded data [13] consists of 2 mixtures of a piece of mu- 
sic (source 1) and a digit (one to ten) counting sentence (source 2) recorded 
in a normal office size room. The sampling frequency is 16 kHz, and 100 
k data points are shown in Fig. 1. The signals last a little over 6 seconds. 
The result of dynamic BSS algorithm is shown in Fig 2. As a comparison, 
we show in Fig. 3 result of batch processing of steps I to III of the algorithm 
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p(-, •) of 3 cases 


mixture 


dyn. separation 


bat. separation 


sources 


(1)-A 


0.8230 


0.0269 


0.0160 


N/A 


(1)-B 


0.8230 


0.0225 


0.0159 


N/A 


(2)-A 


0.6240 


0.0503 


0.0673 


0.0201 


(2)-B 


0.6240 


0.0182 


0.0600 


0.0201 


(3)-A 


0.4613 


0.0351 


0.0378 


0.0243 


(3)-B 


0.4613 


0.0267 


0.0677 


0.0243 



Table 2: Values of the correlation coefficient p(x,y), (x,y) being either the 
two mixtures or the two sources or the two separated signals by dynamic and 
batch methods. The A and B in the first column denote the two different 
ways of selecting the reference frequency uj\. 



p(x,s 1 )/p(x,s 2 ) 


case(2) 


case(3) 


x= dyn. si(A) 
x= dyn. §2(A) 


4.5899 
0.1086 


4.5096 
0.2852 


x= dyn. si(-B) 
x= dyn. S2(B) 


5.3083 
0.0494 


5.8411 
0.2799 


x= bat. si{A) 
x= bat. 82(A) 


15.0912 
0.0760 


1.4632 
0.1665 


x= bat. s~i(B) 
x= bat. s~2{B) 


6.2227 
0.0636 


25.8122 
0.1719 



Table 3: Ratios of p(x,si) and p(x,S2), x being a separated signal on the 
first column by dynamic or batch method, s\ and S2 are source signals. The 
ratio measures the relative closeness of a; to s\ and S2- If the ratio is larger 
(smaller) than one, x is closer to si (52)- The A and B in the first column 
denote the two different ways of selecting u\. 
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with tit = 200. The batch processing gives a clear separation upon listen- 
ing to the separated signals. The dynamic processing is comparable. The 
filter coefficients in the time domain /ijj(r) at the last update of dynamic 
processing are shown in Fig. 4. Due to weighted least square optimization 
in step III, they are localized and oscillatory with support length Q close to 
half of the FFT size T. 

For cases (2) and (3), we show the envelopes of the absolute values of the 
mixtures or the separated signals. The signal envelope was computed using 
the standard procedure of amplitude demodulation, i.e., lowpass filtering 
the rectified signal. The filter was an FIR filter with 400 taps and the 
cutoff frequency was 100 Hz. Signal envelopes help to visualize and compare 
source and processed signals. We have normalized all the envelopes so that 
the maximum height is 1. The values of a^- in (12. ID . which are used to 
synthetically generate the mixtures, are shown in Fig. [5] (see (19^ (8)]). Fig. [6] 
and Fig. [7] show the mixtures and separated signals of case (2). Fig. [8] and 
Fig. [9] show the mixtures and separated signals of case (3) . In view of these 
plots, Table 2 and Table 3, separation is quite satisfactory, which is also 
confirmed by hearing the separated signals. 

The processing time in MATLAB on a laptop can be a factor of 5 to 
8 above the real time signal duration, however, the time is expected to 
be closer to real time with the computation is executed by Fortran or C 
directly or with additional cost reduction techniques. A breakdown of time 
consumption in the algorithm shows that 40% of the processing is spent on 
computing cumulants, 30 % on sorting in frequency and time domains, 15% 
on fixing scaling functions, 3% on joint diagonalization, the rest on other 
operations such as computing lower order statistics, FFT, IFFT etc. 

3 Conclusions 

A dynamic blind source separation algorithm is proposed to track the time 
dependence of signal statistics and to be adaptive to the potentially time 
varying environment. Besides an efficient updating of cumulants, the method 
made precise the procedure of sorting permutation indeterminacy in the fre- 
quency domain by optimizing a metric (the I 1 x l°° norm) on multiple time 
lagged channel correlation coefficients. A direct and efficient weighted least 
square approach is introduced to compactify the support of demixing fil- 
ter to improve the accuracy of frequency domain localization of convolutive 
mixtures. Experimental results show robust and satisfactory separation of 
real recorded data and synthetic mixtures. An interesting line of future work 
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will be concerned with various strategies to reduce computational costs. 
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Figure Captions 



Fig 1: Case (1), two recorded signals in a real room where a speaker was 
counting ten digits with music playing in the background. 

Fig 2: Case (1) with choice A, separated digit counting sentence (bottom) 
and background music (top) by the proposed dynamic method. Choice B 
gives similar results. 

Fig 3: Case (1) with choice A, separated digit counting sentence (bottom) 
and background music (top) by batch processing using the proposed steps I 
to III. Choice B gives similar results. 

Fig 4: Case (1) with choice A, the localized and oscillatory filter coefficients 
in the time domain at the last frame of dynamic processing. Choice B gives 
similar results. 

Fig 5: The weights Ojj used in generating synthetic mixtures of cases (2) 
and (3), as proposed in [19J. 

Fig 6: Case (2), the synthetic mixtures are generated by a female voice and 
a piece of instrumental music. 

Fig 7: Case (2) with choice A, the envelopes of the separated signals from 
mixtures whose envelopes are in Fig. ©. The small amplitude portion of 
the music is well recovered. Choice B gives similar results. 

Fig 8: Case (3), the synthetic mixtures of a female voice and a speech noise 
with signal to noise ratio equal to —3.8206 dB. The x\ plot shows a speech in 
a strong noise, the valley structures in the speech signal are filled by noise. 

Fig 9: Case (3) with choice A, the envelopes of the separated signals, noise 
(top) and speech (bottom). The envelopes of the two mixtures are in Fig.[8l 
The strongly noisy x\ in Fig. [8] has been cleaned, the valleys in the envelope 
re-appeared. Choice B gives an even better result. 
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Recorded mixture signals in a real room 
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Figure 1: 



Separated source of background music by proposed dynamic method 
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Figure 2: 
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Separated source of background music by the batch method 
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Figure 3: 
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envelope 




Figure 7: 




Figure 8: 
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Figure 9: 
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